Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import xgboost as xgb
'figure.figsize'] = (10, 10) plt.rcParams[
kakamana
January 21, 2023
You will learn how to adjust XGBoost’s parameters and how to tune them efficiently so that you can supercharge the performance of your models
This Fine-tuning your XGBoost model is part of Datacamp course: Extreme Gradient Boosting with XGBoost
This is my learning experience of data science through DataCamp
Now that you’ve seen the effect that tuning has on the overall performance of your XGBoost model, let’s turn the question on its head and see if you can figure out when tuning your model might not be the best idea.
Let’s start with parameter tuning by seeing how the number of boosting rounds (number of trees you build) impacts the out-of-sample performance of your XGBoost model. You’ll use xgb.cv()
inside a for loop and build one model per num_boost_round
parameter.
Here, you’ll continue working with the Ames housing dataset. The features are available in the array X
, and the target vector is contained in y
.
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Creata the parameter dictionary for each tree: params
params = {"objective":"reg:squarederror", "max_depth":3}
# Create list of number of boosting rounds
num_rounds = [5, 10, 15]
# Empty list to store final round rmse per XGBoost model
final_rmse_per_round = []
# Interate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in num_rounds:
# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3,
num_boost_round=curr_num_rounds, metrics='rmse',
as_pandas=True, seed=123)
# Append final round RMSE
final_rmse_per_round.append(cv_results['test-rmse-mean'].tail().values[-1])
# Print the result DataFrame
num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))
print(pd.DataFrame(num_rounds_rmses, columns=['num_boosting_rounds', 'rmse']))
print("\nAs you can see, increasing the number of boosting rounds decreases the RMSE.")
num_boosting_rounds rmse
0 5 50903.299752
1 10 34774.194090
2 15 32895.099185
As you can see, increasing the number of boosting rounds decreases the RMSE.
Now, instead of attempting to cherry pick the best possible number of boosting rounds, you can very easily have XGBoost automatically select the number of boosting rounds for you within xgb.cv()
. This is done using a technique called early stopping.
Early stopping works by testing the XGBoost model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric ("rmse"
in our case) does not improve for a given number of rounds. Here you will use the early_stopping_rounds
parameter in xgb.cv()
with a large possible number of boosting rounds (50). Bear in mind that if the holdout metric continuously improves up through when num_boost_rounds
is reached, then early stopping does not occur.
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary for each tree: params
params = {"objective":"reg:squarederror", "max_depth":4}
# Perform cross-validation with early-stopping: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, nfold=3, params=params, metrics="rmse",
early_stopping_rounds=10, num_boost_round=50, as_pandas=True, seed=123)
# Print cv_results
print(cv_results)
train-rmse-mean train-rmse-std test-rmse-mean test-rmse-std
0 141871.635216 403.633062 142640.653507 705.559723
1 103057.033818 73.768079 104907.664683 111.117033
2 75975.967655 253.727043 79262.056654 563.766693
3 57420.530642 521.658273 61620.137859 1087.693428
4 44552.956483 544.170426 50437.560906 1846.446643
5 35763.948865 681.796675 43035.659539 2034.471115
6 29861.464164 769.571418 38600.880800 2169.796804
7 25994.675122 756.520639 36071.817710 2109.795408
8 23306.836299 759.237848 34383.186387 1934.547433
9 21459.770256 745.624640 33509.140338 1887.375358
10 20148.721060 749.612186 32916.806725 1850.893437
11 19215.382607 641.387200 32197.833474 1734.456654
12 18627.388962 716.256240 31770.852340 1802.154296
13 17960.695080 557.043324 31482.782172 1779.124406
14 17559.736640 631.413137 31389.990252 1892.320326
15 17205.713357 590.171774 31302.883291 1955.165882
16 16876.571801 703.631953 31234.058914 1880.706205
17 16597.662170 703.677363 31318.347820 1828.860754
18 16330.460661 607.274258 31323.634893 1775.909992
19 16005.972387 520.470815 31204.135450 1739.076237
20 15814.300847 518.604822 31089.863868 1756.022175
21 15493.405856 505.616461 31047.997697 1624.673447
22 15270.734205 502.018639 31056.916210 1668.043691
23 15086.381896 503.913078 31024.984403 1548.985086
24 14917.608289 486.206137 30983.685376 1663.131135
25 14709.589477 449.668262 30989.476981 1686.667218
26 14457.286251 376.787759 30952.113767 1613.172390
27 14185.567149 383.102597 31066.901381 1648.534545
28 13934.066721 473.465580 31095.641882 1709.225578
29 13749.644941 473.670743 31103.886799 1778.879849
30 13549.836644 454.898742 30976.084872 1744.514518
31 13413.484678 399.603422 30938.469354 1746.053330
32 13275.915700 415.408595 30931.000055 1772.469405
33 13085.878211 493.792795 30929.056846 1765.541040
34 12947.181279 517.790033 30890.629160 1786.510472
35 12846.027264 547.732747 30884.493051 1769.728787
36 12702.378727 505.523140 30833.542124 1691.002007
37 12532.244170 508.298300 30856.688154 1771.445485
38 12384.055037 536.224929 30818.016568 1782.785175
39 12198.443769 545.165604 30839.393263 1847.326671
40 12054.583621 508.841802 30776.965294 1912.780332
41 11897.036784 477.177932 30794.702627 1919.675130
42 11756.221708 502.992363 30780.956160 1906.820178
43 11618.846752 519.837483 30783.754746 1951.260120
44 11484.080227 578.428500 30776.731276 1953.447810
45 11356.552654 565.368946 30758.543732 1947.454939
46 11193.557745 552.298986 30729.971937 1985.699239
47 11071.315547 604.090125 30732.663173 1966.997252
48 10950.778492 574.862853 30712.241251 1957.750615
49 10824.865446 576.665678 30720.853939 1950.511037
It’s time to practice tuning other XGBoost hyperparameters in earnest and observing their effect on model performance! You’ll begin by tuning the "eta"
, also known as the learning rate.
The learning rate in XGBoost is a parameter that can range between 0 and 1, with higher values of "eta"
penalizing feature weights more strongly, causing much stronger regularization.
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:squarederror", "max_depth":3}
# Create list of eta values and empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []
# Systematicallyvary the eta
for curr_val in eta_vals:
params['eta'] = curr_val
# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3,
early_stopping_rounds=5, num_boost_round=10, metrics='rmse', seed=123,
as_pandas=True)
# Append the final round rmse to best_rmse
best_rmse.append(cv_results['test-rmse-mean'].tail().values[-1])
# Print the result DataFrame
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=['eta', 'best_rmse']))
eta best_rmse
0 0.001 195736.402543
1 0.010 179932.183986
2 0.100 79759.411808
In this exercise, your job is to tune max_depth
, which is the parameter that dictates the maximum depth that each tree in a boosting round can grow to. Smaller values will lead to shallower trees, and larger values to deeper trees.
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary
params = {"objective":"reg:squarederror"}
# Create list of max_depth values
max_depths = [2, 5, 10, 20]
best_rmse = []
for curr_val in max_depths:
params['max_depth'] = curr_val
# Perform cross-validation
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
early_stopping_rounds=5, num_boost_round=10, metrics='rmse', seed=123,
as_pandas=True)
# Append the final round rmse to best_rmse
best_rmse.append(cv_results['test-rmse-mean'].tail().values[-1])
# Print the result DataFrame
print(pd.DataFrame(list(zip(max_depths, best_rmse)), columns=['max_depth', 'best_rmse']))
max_depth best_rmse
0 2 37957.469464
1 5 35596.599504
2 10 36065.547345
3 20 36739.576068
Now, it’s time to tune "colsample_bytree"
. You’ve already seen this if you’ve ever worked with scikit-learn’s RandomForestClassifier
or RandomForestRegressor
, where it just was called max_features
. In both xgboost and sklearn, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost, colsample_bytree
must be specified as a float between 0 and 1.
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)
# Create the parameter dictionary
params={"objective":"reg:squarederror", "max_depth":3}
# Create list of hyperparameter values: colsample_bytree_vals
colsample_bytree_vals = [0.1, 0.5, 0.8, 1]
best_rmse = []
# Systematically vary the hyperparameter value
for curr_val in colsample_bytree_vals:
params['colsample_bytree'] = curr_val
# Perform cross-validation
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
num_boost_round=10, early_stopping_rounds=5,
metrics="rmse", as_pandas=True, seed=123)
# Append the final round rmse to best_rmse
best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
# Print the resultant DataFrame
print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)),
columns=["colsample_bytree","best_rmse"]))
print("\nThere are several other individual parameters that you can tune, such as `'subsample'`, which dictates the fraction of the training data that is used during any given boosting round. Next up: Grid Search and Random Search to tune XGBoost hyperparameters more efficiently!")
colsample_bytree best_rmse
0 0.1 40918.116895
1 0.5 35813.904168
2 0.8 35995.678734
3 1.0 35836.044343
There are several other individual parameters that you can tune, such as `'subsample'`, which dictates the fraction of the training data that is used during any given boosting round. Next up: Grid Search and Random Search to tune XGBoost hyperparameters more efficiently!
Now that you’ve learned how to tune parameters individually with XGBoost, let’s take your parameter tuning to the next level by using scikit-learn’s GridSearch
and RandomizedSearch
capabilities with internal cross-validation using the GridSearchCV and RandomizedSearchCV functions. You will use these to find the best model exhaustively from a collection of possible parameter values across multiple parameters simultaneously. Let’s get to work, starting with GridSearchCV
!
from sklearn.model_selection import GridSearchCV
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
'colsample_bytree': [0.3, 0.7],
'n_estimators': [50],
'max_depth': [2, 5]
}
# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor()
# Perform grid search: grid_mse
grid_mse = GridSearchCV(param_grid=gbm_param_grid, estimator=gbm,
scoring='neg_mean_squared_error', cv=4, verbose=1)
# Fit grid_mse to the data
grid_mse.fit(X, y)
# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))
Fitting 4 folds for each of 4 candidates, totalling 16 fits
Best parameters found: {'colsample_bytree': 0.3, 'max_depth': 5, 'n_estimators': 50}
Lowest RMSE found: 28986.18703093561
Often, GridSearchCV
can be really time consuming, so in practice, you may want to use RandomizedSearchCV
instead, as you will do in this exercise. The good news is you only have to make a few modifications to your GridSearchCV
code to do RandomizedSearchCV
. The key difference is you have to specify a param_distributions
parameter instead of a param_grid
parameter.
from sklearn.model_selection import RandomizedSearchCV
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
'n_estimators': [25],
'max_depth': range(2, 12)
}
# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(n_estimators=10)
# Perform random search: randomized_mse
randomized_mse = RandomizedSearchCV(param_distributions=gbm_param_grid, estimator=gbm,
scoring='neg_mean_squared_error', n_iter=5, cv=4,
verbose=1)
# Fit randomized_mse to the data
randomized_mse.fit(X, y)
# Print the best parameters and lowest RMSE
print("Best parameters found: ", randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))
Fitting 4 folds for each of 5 candidates, totalling 20 fits
Best parameters found: {'n_estimators': 25, 'max_depth': 4}
Lowest RMSE found: 29998.4522530019